The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes of geometric objects.
* Data
* Layers, made up of geomtric elements and statistical transformation.
* Scales
* Coordinate system
* Facet: how to break up the data into subsets and display those subsets as small multiples.
* Theme: controls the finer points of display, font size and background colour.
* ggplot2 can only create static graphics.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#library(magrittr)
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
cty and hwy record miles per gallon (mpg) for city and highway driving.
displ is the engine displacement in litres.
drv is the drive train: front wheel (f), rear wheel (r) or four wheel (4).
model is the model of car. There are 38 models, selected because they had a new edition every year between 1999 and 2008.
class (not shown), is a categorical variable describing the “type” of car: two seater, SUV, compact, etc.
ggplot(mpg,aes(x=displ, y=hwy))+
geom_point()
data: mpg
aesthetic mapping: engine size mapped to x position, fuel economy to y position.
layer: points
data and aesthetic mappings are supplied in ggplot(), then layers are added on with +.
ggplot(mpg, aes(displ, cty, colour = class)) +
geom_point()
This gives each point a unique colour corresponding to its class.
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
In the first plot, the value “blue” is scaled to a pinkish colour, and a legend is added. In the second plot, the points are given the R colour blue.
When using aesthetics in a plot, less is usually more.
Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.
Wrapped is the most useful, so we’ll discuss it here, grid facetting later.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth() #se = TRUE
## `geom_smooth()` using method = 'loess'
This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence interval shown in grey.
* method = 'loess', default for small n, uses a smooth local regression, the wiggliness of the line is controlled by the span parameter (0,1).
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 0.2)
## `geom_smooth()` using method = 'loess'
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 1)
## `geom_smooth()` using method = 'loess'
method = 'gam' fits a generalised additive model provided by the mgcv package.ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x))
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm")
method = "rlm" works like lm(), but uses a robust fitting algorithm so that outliers don’t affect the fit as much.library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "rlm")
ggplot(mpg, aes(drv, hwy)) +
geom_point()
Jittering, geom_jitter(), adds a little random noise to the data which can help avoid overplotting.
ggplot(mpg, aes(drv, hwy)) +
geom_jitter()
Boxplots,
geom_boxplot(), summarise the shape of the distribution with a handful of summary statistics.
ggplot(mpg, aes(drv, hwy)) +
geom_boxplot()
Violin plots,
geom_violin(), show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.
ggplot(mpg, aes(drv, hwy)) +
geom_violin()
ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You should always try many bin widths, and you may find you need multiple bin widths to tell the full story of your data.
ggplot(mpg, aes(hwy)) +
geom_freqpoly(binwidth = 2.5)
ggplot(mpg, aes(hwy)) +
geom_freqpoly(binwidth = 1)
I’m not a fan of density plots because they are harder to interpret since the underlying computations are more complex.
ggplot(mpg, aes(hwy)) +
geom_density(binwidth = 1)
## Warning: Ignoring unknown parameters: binwidth
ggplot(mpg, aes(displ, colour = drv)) +
geom_freqpoly(binwidth = 0.5)
ggplot(mpg, aes(displ, fill = drv)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~drv, ncol = 1)
unsummarised data:
ggplot(mpg, aes(manufacturer)) +
geom_bar()
Presumarised data:
drugs <- data.frame(
drug = c("a", "b", "c"),
effect = c(4.2, 9.7, 6.1)
)
ggplot(drugs, aes(drug, effect)) +
geom_bar(stat = "identity")
ggplot(drugs, aes(drug, effect)) + geom_point()
Line plots usually have time on the x-axis, showing how a single variable has changed over time.
ggplot(economics, aes(date, unemploy / pop)) +
geom_line()
ggplot(economics, aes(date, uempmed)) +
geom_line()
Path plots show how two variables have simultaneously changed over time, with time encoded in the way that observations are connected.
ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path() +
geom_point()
Because of the many line crossings, the direction in which time flows isn’t easy to see in the first plot.
year <- function(x) as.POSIXlt(x)$year + 1900
ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path(colour = "grey50") +
geom_point(aes(colour = year(date)))
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3)
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3) +
xlab("city driving (mpg)") +
ylab("highway driving (mpg)")
# Remove the axis labels with NULL
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3) +
xlab(NULL) +
ylab(NULL)
xlim() and ylim() modify the limits of axes:
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25)
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25) +
xlim("f", "r") +
ylim(20, 30)
## Warning: Removed 139 rows containing missing values (geom_point).
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25, na.rm = TRUE) +
ylim(NA, 30)
You can suppress the associated warning with na.rm = TRUE.
p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point()
Render it on screen, with print().
print(p)
Save it to disk, with ggsave()
ggsave("plot.png", width = 5, height = 5)
Briefly describe its structure with summary().
summary(p)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy,
## fl, class [234x11]
## mapping: x = displ, y = hwy, colour = factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map: function
## map_data: function
## params: list
## render_back: function
## render_front: function
## render_panels: function
## setup_data: function
## setup_params: function
## shrink: TRUE
## train: function
## train_positions: function
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
Save a cached copy of it to disk, with saveRDS(). This saves a complete copy of the plot object, so you can easily re-create it with readRDS().
qplot(displ, hwy, data = mpg)
qplot(displ, data = mpg)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Unless otherwise specified, qplot() tries to pick a sensible geometry and statistic based on the arguments provided.
If you want to set an aesthetic to a constant, you need to use I():
qplot(displ, hwy, data = mpg, colour = "blue")
qplot(displ, hwy, data = mpg, colour = I("blue"))
Each of these geoms is two dimensional and requires both x and y aesthetics.
All of them understand colour (or color) and size aesthetics, and the filled geoms (bar, tile and polygon) also understand fill.
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL) + # Hide axis label
theme(plot.title = element_text(size = 12)) # Shrink plot title
p + geom_area() + ggtitle("area")
geom_bar(stat = "identity")makes a barchart. We need stat = "identity" because the default stat automatically counts values (so is essentially a 1d geom. The identity stat leaves the data unchanged. Multiple bars in the same location will be stacked on top of one another.p + geom_bar(stat = "identity") + ggtitle("bar")
geom_line() makes a line plot. The group aesthetic determines which observations are connectedp + geom_line() + ggtitle("line")
p + geom_path() + ggtitle("path")
geom_point() produces a scatterplot. geom_point() also understands the shape aesthetic.p + geom_point() + ggtitle("point")
p + geom_polygon() + ggtitle("polygon")
geom_rect() is parameterised by the four corners of the rectangle, xmin, ymin, xmax and ymax. geom_tile() is exactly the same, but parameterised by the center of the rect and its size, x, y, width and height.geom_raster() is a fast special case of geom_tile() used when all the tiles are the same size.p + geom_tile() + ggtitle("raster")
geom_text() has the most aesthetics of any geom
* family gives the name of a font.
df <- data.frame(x = 1, y = 3:1, family = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = family, family = family))
fontface specifies the face: “plain” (the default), “bold” or “italic”.df <- data.frame(x = 1, y = 3:1, face = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = face, fontface = face))
You can adjust the alignment of the text with the hjust (“left”, “center”, “right”, “inward”, “outward”) and vjust (“bottom”, “middle”, “top”, “inward”, “outward”) aesthetics.
df <- data.frame(
x = c(1, 1, 2, 2, 1.5),
y = c(1, 2, 1, 2, 1.5),
text = c(
"bottom-left", "bottom-right",
"top-left", "top-right", "center"
)
)
ggplot(df, aes(x, y)) +
geom_text(aes(label = text))
ggplot(df, aes(x, y)) +
geom_text(aes(label = text), vjust = "inward", hjust =
"inward")
size controls the font size.angle specifies the rotation of the text in degrees.
The nudge_x and nudge_y parameters allow you to nudge the text a little horizontally or vertically:
df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) +
geom_point() +
geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) +
xlim(1, 3.6)
check_overlap = TRUE, overlapping labels will be automatically removed.ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = model)) +
xlim(1, 8)
ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = model), check_overlap = TRUE) +
xlim(1, 8)
A variation on geom_text() is geom_label(): it draws a rounded rectangle behind the text. This makes it useful for adding labels to plots with busy backgrounds:
label <- data.frame(
waiting = c(55, 80),
eruptions = c(2, 4.3),
label = c("peak one", "peak two")
)
ggplot(faithfuld, aes(waiting, eruptions)) +
geom_tile(aes(fill = density)) +
geom_label(data = label, aes(label = label))
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point(show.legend = FALSE) +
directlabels::geom_dl(aes(label = class), method =
"smart.grid")
geom_text() to add text descriptions or to label points
geom_rect() to highlight interesting rectangular regions of the plot.
geom_line(), geom_path() and geom_segment() to add lines.
ggplot(economics, aes(date, unemploy)) +
geom_line()
geom_vline(), geom_hline() and geom_abline() allow you to add reference lines (sometimes called rules), that span the full range of the plot.presidential <- subset(presidential, start > economics$date[1])
ggplot(economics) +
geom_rect(
aes(xmin = start, xmax = end, fill = party),
ymin = -Inf, ymax = Inf, alpha = 0.2,
data = presidential
) +
geom_vline(
aes(xintercept = as.numeric(start)),
data = presidential,
colour = "grey50", alpha = 0.5
) +
geom_text(
aes(x = start, y = 2500, label = name),
data = presidential,
size = 3, vjust = 0, hjust = 0, nudge_x = 50
) +
geom_line(aes(date, unemploy)) +
scale_fill_manual(values = c("blue", "red"))
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have
varied a lot over the years", 40),
collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
geom_line() +
geom_text(
aes(x, y, label = caption),
data = data.frame(x = xrng[1], y = yrng[2], caption = caption),
hjust = 0, vjust = 1, size = 4
)
It’s easier to use the annotate() helper function which creates the data frame for you:
ggplot(economics, aes(date, unemploy)) +
geom_line() +
annotate("text", x = xrng[1], y = yrng[2], label = caption,hjust = 0, vjust = 1, size = 4
)
it’s much easier to see the subtle differences if we add a reference line.
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
facet_wrap(~cut, nrow = 1)
mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
geom_abline(intercept = mod_coef[1], slope = mod_coef[2],
colour = "white", size = 1) +
facet_wrap(~cut, nrow = 1)
An individual geom draws a distinct graphical object for each observation (row). For example, the point geom draws one point per row. A collective geom displays multiple observations with one geometric object.
By default, the group aesthetic is mapped to the interaction of all discrete variables in the plot.
data(Oxboys, package = "nlme")
head(Oxboys)
## Grouped Data: height ~ age | Subject
## Subject age height Occasion
## 1 1 -1.0000 140.5 1
## 2 1 -0.7479 143.4 2
## 3 1 -0.4630 144.8 3
## 4 1 -0.1643 147.1 4
## 5 1 -0.0027 147.7 5
## 6 1 0.2466 150.2 6
you want to be able to distinguish individual subjects, but not identify them.
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
Incorrect:
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()
showing the overall trend for all boys.
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE)
Instead of setting the grouping aesthetic in ggplot(), where it will apply to all layers, we set it in geom_line() so it applies only to the lines.
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) +
geom_smooth(method = "lm", size = 2, se = FALSE)
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot()
Now we want to overlay lines that connect each individual boy. Simply adding geom_line() does not work: the lines are drawn within each occassion, not across each subject.
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(colour = "#3366FF", alpha = 0.5)
we need to override the grouping to say we want one line per boy:
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)
There is one more observation than line segment, and so the aesthetic for the first observation is used for the first segment, the second observation for the second segment and so on.This means that the aesthetic for the last observation is not used:
df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))
ggplot(df, aes(x, y, colour = factor(colour))) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
ggplot(df, aes(x, y, colour = colour)) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
you can perform the linear interpolation yourself:
xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
x = xgrid,
y = approx(df$x, df$y, xout = xgrid)$y,
colour = approx(df$x, df$colour, xout = xgrid)$y
)
ggplot(interp, aes(x, y, colour = colour)) +
geom_line(size = 2) +
geom_point(data = df, size = 5)
how would you colour a polygon that had a different fill colour for each point on its border?
ggplot(mpg, aes(class)) +
geom_bar()
ggplot(mpg, aes(class, fill = drv)) +
geom_bar()
If you try to map fill to a continuous variable in the same way, it doesn’t work.
To show multiple colours, we need multiple bars for each class, which we can get by overriding the grouping:
ggplot(mpg, aes(class, fill = hwy)) +
geom_bar()
ggplot(mpg, aes(class, fill = hwy, group = hwy)) +
geom_bar()
ggplot(faithfuld, aes(eruptions, waiting)) +
geom_contour(aes(z = density, colour = ..level..))
ggplot(faithfuld, aes(eruptions, waiting))+
geom_raster(aes(fill = density))
Bubble plots work better with fewer observations
small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) +
geom_point(aes(size = density), alpha = 1/3) +
scale_size_area()
Vector boundaries are defined by a data frame with one row for each “corner” of a geographical region like a country, state, or county. It requires four variables:
* lat and long, giving the location of a point.
* group, a unique identifier for each contiguous region.
* id, the name of the region.
#mi_counties <- ggplot2::map_data("county", "michigan") %>%
# select(lon = long, lat, group, id = subregion)
#head(mi_counties)
#ggplot(mi_counties, aes(lon, lat)) +
# geom_polygon(aes(group = group)) +
# coord_quickmap()
#ggplot(mi_counties, aes(lon, lat)) +
# geom_polygon(aes(group = group), fill = NA, colour = "grey50") +
# coord_quickmap()
#mi_cities <- maps::us.cities %>%
# tbl_df() %>%
# filter(country.etc == "MI") %>%
# select(-country.etc, lon = long) %>%
# arrange(desc(pop))
#mi_cities
It’s not terribly useful without a reference. You almost always combine point metadata with another layer to make it interpretable.
#ggplot(mi_cities, aes(lon, lat)) +
# geom_point(aes(size = pop)) +
# scale_size_area() +
# coord_quickmap()
#ggplot(mi_cities, aes(lon, lat)) +
# geom_polygon(aes(group = group), mi_counties, fill = NA, colour = "grey50") +
# geom_point(aes(size = pop), colour = "red") +
# scale_size_area() +
# coord_quickmap()
geom_errorbar, geom_linerange()y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
base + geom_errorbar()
base + geom_linerange()
* Discrete x, range & center:
geom_crossbar(),geom_pointrange()
base + geom_crossbar()
base + geom_pointrange()
geom_ribbon()base + geom_ribbon()
geom_smooth(stat = "identity")base + geom_smooth(stat = "identity")
There are two aesthetic attributes that can be used to adjust for weights. Firstly, for simple geoms like lines and points, use the size aesthetic:
Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point()
Weight by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))
These weights will be passed on to the statistical summary function.
Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point() +
geom_smooth(method = lm, size = 1)
Weighted by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
scale_size_area(guide = "none")
The following code shows the difference this makes for a histogram of the percentage below the poverty line:
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(binwidth = 1) +
ylab("Counties")
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal), binwidth = 1) +
ylab("Population (1000s)")
## Warning: Ignoring unknown aesthetics: weight
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 53,930 more rows
For 1d continuous distributions the most important geom is the histogram, geom_histogram():
ggplot(diamonds, aes(depth)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(depth)) +
geom_histogram(binwidth = 0.1) +
xlim(55, 70)
## Warning: Removed 45 rows containing non-finite values (stat_bin).
If you want to compare the distribution between groups, you have a few options:
* Show small multiples of the histogram, facet_wrap(~ var).
ggplot(diamonds, aes(depth)) +
geom_freqpoly(aes(colour = cut), binwidth = 0.1) +
xlim(58, 68) +
theme(legend.position = "none")
## Warning: Removed 669 rows containing non-finite values (stat_bin).
## Warning: Removed 10 rows containing missing values (geom_path).
ggplot(diamonds, aes(depth)) +
geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill") +
xlim(58, 68) +
theme()
## Warning: Removed 669 rows containing non-finite values (stat_bin).
An alternative to a bin-based visualisation is a density estimate.
geom_density() places a little normal distribution at each data point and sums up all the curves.
ggplot(diamonds, aes(depth)) +
geom_density(na.rm = TRUE) +
xlim(58, 68) +
theme()
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.2, na.rm = TRUE) +
xlim(58, 68) +
theme()
sometimes you want to compare many distributions, and it’s useful to have alternative options that sacrifice quality for quantity. Here are three options: * geom_boxplot(): the box-and-whisker plot shows five summary statistics along with individual “outliers”.
ggplot(diamonds, aes(clarity, depth)) +
geom_boxplot()
ggplot(diamonds, aes(carat, depth)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
## Warning: Removed 997 rows containing non-finite values (stat_boxplot).
geom_violin(): the violin plot is a compact version of the density plot.ggplot(diamonds, aes(clarity, depth)) +
geom_violin()
ggplot(diamonds, aes(carat, depth)) +
geom_violin(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
## Warning: Removed 997 rows containing non-finite values (stat_ydensity).
geom_dotplot(): draws one point for each observationdf <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
norm + geom_point()
norm + geom_point(shape = 1) # Hollow circles
norm + geom_point(shape = ".") # Pixel sized
norm + geom_point(alpha = 1 / 3)
norm + geom_point(alpha = 1 / 5)
norm + geom_point(alpha = 1 / 10)
you can randomly jitter the points to alleviate some overlaps with geom_jitter().
Bin the points and count the number in each bin, then visualise that count (the 2d generalisation of the histogram), geom_bin2d().
norm + geom_bin2d()
norm + geom_bin2d(bins = 10)
norm + geom_hex()
norm + geom_hex(bins = 10)
how we can count the number of diamonds in each bin:
ggplot(diamonds, aes(color)) +
geom_bar()
ggplot(diamonds, aes(color, price)) +
geom_bar(stat = "summary_bin", fun.y = mean)
add na.rm back
ggplot(diamonds, aes(table, depth)) +
geom_bin2d(binwidth = 1) +
xlim(50, 70) +
ylim(50, 70)
## Warning: Removed 36 rows containing non-finite values (stat_bin2d).
ggplot(diamonds, aes(table, depth, z = price)) +
geom_raster(binwidth = 1, stat = "summary_2d", fun = mean) +
xlim(50, 70) +
ylim(50, 70)
## Warning: Removed 36 rows containing non-finite values (stat_summary2d).